Analyses of retweets on the topic of Lehrkräftebildung
Information about the dataset:
Code
# PREPARE DATAimport jsonimport pandas as pdfrom src.analysis_functions import ( get_query_info, get_time_range, metadata_prints, prep_data,)title ="Lehrkräftebildung"topic_name ="lehrkraeftebildung"data_path ="data/raw/all_tweets_lehrkraeftebildung.parquet"# get information about the topicsearch_words, query_conds = get_query_info( json_file="assets/topics.json", topic_name=topic_name)# load datasetdf = prep_data(data_path)# get information about the time rangestart_date, end_date = get_time_range(df, "de_DE")metadata_prints(df, start_date, end_date, search_words, query_conds)# drop retweets with missing usernamestry: missing_usernames = df.tweet_author_username.isnull().value_counts()[True]exceptKeyError: missing_usernames =0df = df.dropna(subset=["tweet_author_username"])print(f"\nNumber of retweets with missing usernames for the original tweeter: {missing_usernames}\nThese are being dropped from the analysis. New total of retweets: {len(df)}\n")
Number of total retweets in this dataset:
24689
Time range of the retweets:
February 23, 2023 - June 8, 2023
Keywords* used to collect the retweets:
(Lehrkräftebildung OR Lehrerbildung OR Lehrkräfte OR Lehrkräftefortbildung OR Seiteneinstieg OR Quereinstieg OR Lehramt)
Query conditions used to collect the retweets:
(is:retweet OR is:quote) lang:de
Number of retweets with missing usernames for the original tweeter: 277
These are being dropped from the analysis. New total of retweets: 24412
*They keywords above mean the retweets were collected from Twitter whenever any of the above keywords were found in the text of the retweet.
Visualizing the Network
Here are a few important things to note about the network visualization:
Every node (i.e. circle) represents a Twitter user.
The colour of the node represents the number of times that user’s tweets were retweeted. In other words, the darker the colour of the node, the more times that user was retweeted by other users. Those who only retweet and do not produce any tweets themselves are coloured bright yellow.
Similarly, the size of the node represents how many connections each node has. This includes incoming and outgoing connections, meaning both how many times they retweeted another user’s tweet and how many times their original tweets were retweeted. So both users who get retweeted very often by other users (i.e. content producers) and users who simply retweet a lot (i.e. content spreaders or even bots) will have larger sizes.
Code
from src.analysis_functions import generate_network_graphpopular_tweeters, bridging_users, active_retweeters = generate_network_graph(df, title)
Calculating the Importance of Users in the Network
Below are three measures of centrality for each user in the network:
In-degree centrality represents the number of connections going into a node.
In the case of retweets, in-degree centrality will indicate that a user is getting a large number of retweets.
Out-degree centrality represents the number of connections going out of a node.
In the case of retweets, out-degree centrality will indicate that a user is retweeting a lot.
Betweeness centrality represents the number of ‘shortest paths’ between nodes that pass through through a specific node.
In the case of retweets, it measures the extent to which a user connects different communities of users.
Code
from src.analysis_functions import get_authors_name, get_top_usersauthors = get_authors_name(df)
In-degree centrality / users getting a lot of retweets
Below are the top 20 users with the most in-degree centrality / retweeted tweets.